Does data splitting improve prediction?
نویسنده
چکیده
Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.
منابع مشابه
A Fast Inter Prediction Algorithm Based on Rate-Distortion Cost in HEVC
As one of the most important video compression technologies, inter prediction coding is highly efficient in reducing the temporal redundancy of video sequence. However, complicated inter prediction for the latest High Efficiency Video Coding standard (HEVC) brings high computational complexity and seriously restricts the encoding speed. In this paper, a fast inter prediction algorithm based on ...
متن کاملCompound decomposition in dutch large vocabulary speech recognition
This paper addresses compound splitting for Dutch in the context of broadcast news transcription. Language models were created using original text versions and text versions that were decomposed using a data-driven compound splitting algorithm. Language model performances were compared in terms of outof-vocabulary rates and word error rates in a real-world broadcast news transcription task. It ...
متن کاملCompound decomposition in Dutch large
This paper addresses compound splitting for Dutch in the context of broadcast news transcription. Language models were created using original text versions and text versions that were decomposed using a data-driven compound splitting algorithm. Language model performances were compared in terms of outof-vocabulary rates and word error rates in a real-world broadcast news transcription task. It ...
متن کاملLIMIT AVERAGE SHADOWING AND DOMINATED SPLITTING
In this paper the notion of limit average shadowing property is introduced for diffeomorphisms on a compact smooth manifold M and a class of diffeomorphisms is given which has the limit average shadowing property, but does not have the shadowing property. Moreover, we prove that for a closed f-invariant set Lambda of a diffeomorphism f, if Lambda is C1-stably limit average shadowing and t...
متن کاملApplying Gaussian Distribution-Dependent Criteria to Decision Trees for High-Dimensional Microarray Data
Microarray data presents an interesting problem to machine learning algorithms due to their highdimension and small number of samples. Since algorithms such as support vector machines (SVM) typically achieve high prediction accuracies, other methods garner a relatively small amount of attention even though they possess other characteristics which are useful for microarray analysis. In this post...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Statistics and Computing
دوره 26 شماره
صفحات -
تاریخ انتشار 2016